This is my RMD-output site for course exercises in Open Data Science MOOC-course. My repository for this course can be found here
I’m new to version control in github, so let’s see if this becomes a habit. :)
Done: * Registered to github, installed it and pushed my first files * Committed and pushed updated files to git
This data consist of seven variables and 166 observations. The data was collected in 2014 (Vehkalahti, 2014) from students attending to a statistics in social sciences course. The survey assessed attitudes towards statistics, individual learning styles and points achieved in the course by the individual. In the data used here, items have been aggregated (Attitude, deep, strat, surf), and divided by the N of items. Further, rows where Points = 0 were removed from the dataset (N = 183, after removing zeros, N = 166). Find my R-file for data wrangling here.
More information of the dataset at hand can be found here
Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183 Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish), international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.
setwd("C:/Users/rekar/Documents/Road to PhD/Kurssit/Open data science/IODS-project/data")
lrn <- read.csv("lrn.csv")
library(tidyverse)
library(viridis)
#Rename gender variable for plotting purposes
lrn <-
lrn %>%
mutate(gender=recode(gender, "M"="Male", "F"="Female"))
head(lrn)
## gender Age Attitude deep stra surf Points
## 1 Female 53 3.7 3.583333 3.375 2.583333 25
## 2 Male 55 3.1 2.916667 2.750 3.166667 12
## 3 Female 49 2.5 3.500000 3.625 2.250000 24
## 4 Male 53 3.5 3.500000 3.125 2.250000 10
## 5 Male 49 3.7 3.666667 3.625 2.833333 22
## 6 Female 38 3.8 4.750000 3.625 2.416667 21
Below you can find a summary of the data produced by describe from the psych-library. After this I examined the data visually using ggpairs. After this I ran two plots to show the distributions more clearly and further run a few geom_smooths for fun. I tried to add the viridis colouring to the ggpairs, but it didn’t get applicated to all of the plots. If you, dear student-peer-reviewer, have a solution to this, please comment your solution in the peer-review.
First we take a look at the distribution of age and gender, and we see that the males attending to this course were older than the females. Second we look at the distribution in attitudes towards statistics, where we find the mean (marked as an asterix) and median being higher among males. Third, I was interested if age had any association to the points acquired, for which a a loess smooth was calculated to overfit the data.
After this I plotted a few scatter plots with geom_smooths (OLS regressions) to assess relationships among variables. We find a clear relation between attitude and points earned in the course. After this relation of attitude was examined to learning styles. Visually this relationship seems to be only among males.
Further, the relation of learning strategies to acquired course points were assessed. Here we can find and association of learning strategies to points earned and high learning strategy associated with lower course points. However, CI-bands are quite wide.
#Using the psych library for extensive summary
library(psych)
library(GGally)
describe(lrn)
## vars n mean sd median trimmed mad min max range skew
## gender* 1 166 1.34 0.47 1.00 1.30 0.00 1.00 2.00 1.00 0.68
## Age 2 166 25.51 7.77 22.00 23.99 2.97 17.00 55.00 38.00 1.89
## Attitude 3 166 3.14 0.73 3.20 3.15 0.74 1.40 5.00 3.60 -0.08
## deep 4 166 3.68 0.55 3.67 3.70 0.62 1.58 4.92 3.33 -0.50
## stra 5 166 3.12 0.77 3.19 3.14 0.83 1.25 5.00 3.75 -0.11
## surf 6 166 2.79 0.53 2.83 2.78 0.62 1.58 4.33 2.75 0.14
## Points 7 166 22.72 5.89 23.00 23.08 5.93 7.00 33.00 26.00 -0.40
## kurtosis se
## gender* -1.54 0.04
## Age 3.24 0.60
## Attitude -0.48 0.06
## deep 0.66 0.04
## stra -0.45 0.06
## surf -0.27 0.04
## Points -0.26 0.46
ggpairs(lrn, mapping = aes(col=gender, alpha=0.2), lower=list(combo = wrap("facethist", bins =20))) + scale_colour_viridis_d(option = "H")
#Age * sex
lrn %>%
ggplot(aes(gender, Age, colour=gender)) +
geom_violin() +
scale_colour_viridis_d(option="H") +
geom_boxplot(width=0.2, size=0.2, color="black", alpha=0.4, outlier.size = 0) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
theme_bw()
#Attitud * Sex
lrn %>%
ggplot(aes(gender, Attitude, color=gender)) +
geom_violin() +
scale_colour_viridis_d(option="H") +
geom_boxplot(width=0.2, size=0.2, color="black", alpha=0.4, outlier.size = 0) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
theme_bw()
#Age's relation to points
lrn %>%
ggplot(aes(Age, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=loess) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Attitude * points
lrn %>%
ggplot(aes(Attitude, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Attitude * learning style deep
lrn %>%
ggplot(aes(Attitude, deep, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Attitude * learning strategy
lrn %>%
ggplot(aes(Attitude, stra, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#attitude + surface learning style
lrn %>%
ggplot(aes(Attitude, surf, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
#Learning style deep * points
lrn %>%
ggplot(aes(deep, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
lrn %>%
ggplot(aes(stra, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
lrn %>%
ggplot(aes(surf, Points, color=gender)) +
geom_point(alpha=0.2) +
geom_smooth(method=lm) +
scale_colour_viridis_d(option="H") +
theme_bw()
For this exercise we were asked to fit three IV’s to model the relationship to points earned, which is the DV of the model, and remove non significant IV’s in the final model.
I will fit the model (OLS regression) with attitude, learning strategy and surface learning as IV’s, as the association was of these was observed in the above visualisations.
Below we can see the summaries of lfit (with all the three IV’s aforementioned) and lfit2 (attitude as the only IV). We find that that the attitude has a quite of a strong association to the points acquired (B = 3.39, p = <.001). Surface learning strategy is associated negatively to the outcome, and learning strategies positively, however these two were far from significant. Surface learning style and stra were not significantly associated to Points as single predictors either. The multiple R squared is .1 higher in lfit, so the added predictors do account to a tiny bit of the variance explained. However, taking into account the NS of these predictors and that the adjusted R squared change is only <.1, we can be happy with attitude explaining 19% of the variance in acquired course points.
lfit <- lm(Points ~ Attitude + surf + stra, data=lrn)
summary(lfit)
##
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = lrn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#removing NS IV's
lfit2 <- lm(Points ~ Attitude, data=lrn)
summary(lfit2)
##
## Call:
## lm(formula = Points ~ Attitude, data = lrn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Residuals vs fitted plot show us that the assumption of linear association is reasonable. Q-Q plot affirms us the assumption of normality, notwithstanding that a few outliers emerge. According to these diagnostics, the LM assumption of Points ~ Attitude is reasonably valid.
par(mfrow = c(2,2))
plot(lfit2, which=c(1,2,5))
setwd("C:/Users/rekar/Documents/Road to PhD/Kurssit/Open data science/IODS-project/data")
pormath <- read.csv("pormath.csv")
library(dplyr)
library(tidyverse)
library(GGally)
library(viridis)
library(sjPlot)
library(cowplot)
In this Exercise, the data assessed is retrieved from the Machine Learning Repository, collected by Paolo Cortez. It comprises of two data sets of Secondary school students from Portuguese schools. The data comprises of both school reports and questionnaires. More information about the data set used in this exercise can be found here. The two separate data sets are about performance in maths and Portuguese, respectively. Only the participants who were in the both data sets, are included in the combined data used in this exercise and others excluded. Hereby, the total N of observations is 370.
colnames(pormath)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
The association of the following variables to high alcohol consumption will be assessed. High alcohol consumption here is defined as > 2, which is computed from alcohol consumption during weekdays summed with alcohol consumption during weekends (very low=1, to very high=5) divided by two.
In the following arguments for choosing these variables, I won’t be citing any research, since it is out of the scope of the current exercise. However, for presenting working hypotheses, I shortly argue why I except certain outcomes. Furthermore, I have not familiarised myself with the socio-cultural nor with the contextual factors that should be taken into account when assessing the following factors.
First, parental cohabitation (Pstatus) is an interesting variable, since co-parenting is often associated with better all-round well-being in children and adolescents. I expect cohabiting parenthood being associated with lower levels of alcohol consumption. Second, I find interesting to assess if attending to nursery has an association with alcohol consumption. As I am not familiar with the system regarding nurseries in Portugal, I can only take a wild guess, and except that those who attended nursery school (i.e., pre-school) have been more prepared to start primary school, and therefore experienced less hardships related to the very start of their educational path. Third, I assess the association of number of past class failures with high alcohol consumption. I assume that some adolescents not having resources (social or emotional/instrumental) to cope with failures, may build self-handicapping strategies, which then may turn into a vicious cycle; hence the higher the amount of failures, the higher the probability of high alcohol consumption. Finally, I except the perceived familial relations (famrel) having also an association with the alcohol consumption outcome: the higher the familial relations, the lower the probability of high alcohol consumption.
###Select the variables to a new DF####
pormath_s <- pormath %>%
select(Pstatus, nursery, failures, famrel, sex, age, high_use)
### Change logical high_use to numerical ####
#pormath_s$high_use <- as.numeric(pormath_s$high_use) # 1 = True, 0 = False
### Distributions with the discrete variable (alc_use) ####
p1 <- (pormath %>%
ggplot(aes(Pstatus, alc_use)) +
see::geom_violinhalf() +
geom_boxplot(width=0.2, size=0.2, alpha=0.4) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Parents Apart (A) or Together (T)?")
)
p2 <- (pormath %>%
ggplot(aes(nursery, alc_use)) +
see::geom_violinhalf() +
geom_boxplot(width=0.2, size=0.2, alpha=0.4) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Attended nursery school?")
)
p3 <- (pormath %>%
ggplot(aes(failures, alc_use, color=alc_use)) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Number of past Class failures")
)
p4 <- (pormath %>%
ggplot(aes(famrel, alc_use)) +
stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
theme_bw() +
ylab("Alcohol use") +
xlab("Familial relations \n (1 = very bad, 5 = excellent)")
)
plot_grid(p1, p2, p3, p4)
### Cross-tabulations ####
tab_xtab(var.row = pormath_s$Pstatus, var.col = pormath_s$high_use, title = "Parents Cohabiting?", show.row.prc = TRUE)
| Pstatus | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| A |
26 68.4Â % |
12 31.6Â % |
38 100Â % |
| T |
233 70.2Â % |
99 29.8Â % |
332 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=0.001 · df=1 · φ=0.012 · p=0.970 |
tab_xtab(var.row = pormath_s$nursery, var.col = pormath_s$high_use, title = "Attended to Nursery School?", show.row.prc = TRUE)
| nursery | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| no |
46 63.9Â % |
26 36.1Â % |
72 100Â % |
| yes |
213 71.5Â % |
85 28.5Â % |
298 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=1.249 · df=1 · φ=0.066 · p=0.264 |
tab_xtab(var.row = pormath_s$failures, var.col = pormath_s$high_use, title = "Number of past Class failures", show.row.prc = TRUE)
| failures | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 0 |
238 73.2Â % |
87 26.8Â % |
325 100Â % |
| 1 |
12 50Â % |
12 50Â % |
24 100Â % |
| 2 |
8 47.1Â % |
9 52.9Â % |
17 100Â % |
| 3 |
1 25Â % |
3 75Â % |
4 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.003 |
tab_xtab(var.row = pormath_s$famrel, var.col = pormath_s$high_use, title = "Familial relations", show.row.prc = TRUE)
| famrel | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 1 |
6 75Â % |
2 25Â % |
8 100Â % |
| 2 |
9 50Â % |
9 50Â % |
18 100Â % |
| 3 |
39 60.9Â % |
25 39.1Â % |
64 100Â % |
| 4 |
128 71.1Â % |
52 28.9Â % |
180 100Â % |
| 5 |
77 77Â % |
23 23Â % |
100 100Â % |
| Total |
259 70Â % |
111 30Â % |
370 100Â % |
χ2=8.466 · df=4 · Cramer’s V=0.151 · Fisher’s p=0.068 |
Of the expectations, only two of the variables had statistically signicant relationship with high consumption of alcohol, namely number of past class failures with a positive prediction (CI 95 % for OR = 1.31 — 2.86) and family relations with a negative prediction (CI 95% for OR = 0.60 — 0.98). The other two variables had no statistically significant relationship with high consumption of alcohol.
m1 <- glm(high_use ~ Pstatus + nursery + failures + famrel, family="binomial", data=pormath_s)
jtools::summ(m1)
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(4) = 18.16, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.07
## Pseudo-R² (McFadden) = 0.04
## AIC = 443.88, BIC = 463.45
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) 0.39 0.64 0.61 0.54
## PstatusT -0.10 0.38 -0.27 0.79
## nurseryyes -0.31 0.29 -1.08 0.28
## failures 0.65 0.20 3.29 0.00
## famrel -0.27 0.12 -2.14 0.03
## ------------------------------------------------
OR1 <- coef(m1) %>% exp
CI1 <- confint(m1) %>% exp
cbind(OR1, CI1)
## OR1 2.5 % 97.5 %
## (Intercept) 1.4764105 0.4176121 5.1647164
## PstatusT 0.9038712 0.4366755 1.9686682
## nurseryyes 0.7345732 0.4215456 1.3009277
## failures 1.9163856 1.3101788 2.8641095
## famrel 0.7654375 0.5985101 0.9775722
Here I further examine the predictive power of the predictors, that were significant in the above model. The predictive power is fair, as it predicted about 30% of the outcomes wrong.
# predict() the probability of high_use
probabilities <- predict(m1, type = "response")
# add the predicted probabilities to 'alc'
pormath_s <- mutate(pormath_s, probability = probabilities)
# use the probabilities to make a prediction of high_use
pormath_s <- mutate(pormath_s, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = pormath_s$high_use, prediction = pormath_s$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 99 12
ggplot(pormath_s, aes(probability, high_use, col=prediction)) +
geom_point()
table(high_use = pormath_s$high_use, prediction = pormath_s$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67027027 0.02972973 0.70000000
## TRUE 0.26756757 0.03243243 0.30000000
## Sum 0.93783784 0.06216216 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = pormath_s$high_use, prob = pormath_s$probability)
## [1] 0.2972973
The ten-fold cross-validation shows a close performance of the predictive power of this model in the training and test sets (about 31% wrong) compared to to testing it with the whole data (30 % wrong).
loss_func(class = pormath_s$high_use, prob = pormath_s$probability)
## [1] 0.2972973
cv <- boot::cv.glm(data = pormath_s, cost = loss_func, glmfit = m1, K = 10)
cv$delta
## [1] 0.3027027 0.3024324
Here I compare two different models to the M1 tested above. Before evaluating predictive power further, I examine these models by BIC change and R^2 -change. We find the third model below being the best of the three according to BIC and R^2 measures.
Further, the prediction power of M3 (cross-validated error rate about 22%) exceeds the power of the first model (M1, about 30% error rate) tested above, and moreover exceeds the predictive power of the model introduced in the course Datacamp (about 26% error rate).
According to this model (M3, last summary below), males had higher probability of high alcohol consumption compared to their female counterparts (CI 95 % for OR = 1.66 — 4.76). Second, adolescents who spent more time outside, were also more likely to exceed our cut-off point for high alcohol consumption (CI 95 % for OR = 1.68 — 2.77). Third, adolescents living in urban areas were less likely to do so compared to their counterparts living in rural areas (CI 95 % for OR = 0.28 — 0.97). Fourth, adolescents who had failed classes, were more likely to do so (CI 95 % for OR = 0.95 — 2.37), however it is worth noting, that predicting alcohol consumption based on experienced failures with this data builds on a very few observations (see observations in each class from the cross-tabulations above). Fifth, adolescents with good familial relations were less likely to do so (CI 95 % for OR = 0.50 — 0.89). Finally, adolescents with more absences were also more likely to consume more alcohol, however the OR was quite modest (CI 95 % for OR = 1.04 — 1.13).
jtools::summ(m1)
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(4) = 18.16, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.07
## Pseudo-R² (McFadden) = 0.04
## AIC = 443.88, BIC = 463.45
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) 0.39 0.64 0.61 0.54
## PstatusT -0.10 0.38 -0.27 0.79
## nurseryyes -0.31 0.29 -1.08 0.28
## failures 0.65 0.20 3.29 0.00
## famrel -0.27 0.12 -2.14 0.03
## ------------------------------------------------
jtools::summ(glm(high_use ~ failures + sex + address + famrel + freetime + goout + internet, family="binomial", data=pormath)) #freetime and internet access not significant; removing them.
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(7) = 83.08, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.29
## Pseudo-R² (McFadden) = 0.18
## AIC = 384.96, BIC = 416.27
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) -2.23 0.77 -2.89 0.00
## failures 0.45 0.23 1.99 0.05
## sexM 0.89 0.26 3.38 0.00
## addressU -0.70 0.32 -2.22 0.03
## famrel -0.43 0.14 -3.00 0.00
## freetime 0.11 0.14 0.75 0.46
## goout 0.76 0.13 5.90 0.00
## internetyes 0.26 0.38 0.69 0.49
## ------------------------------------------------
jtools::summ(glm(high_use ~ failures + sex + address + famrel + goout + absences, family="binomial", data=pormath))
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## <U+03C7>²(6) = 94.64, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.32
## Pseudo-R² (McFadden) = 0.21
## AIC = 371.39, BIC = 398.79
##
## Standard errors: MLE
## ------------------------------------------------
## Est. S.E. z val. p
## ----------------- ------- ------ -------- ------
## (Intercept) -2.27 0.71 -3.22 0.00
## failures 0.40 0.23 1.73 0.08
## sexM 1.03 0.27 3.82 0.00
## addressU -0.65 0.31 -2.07 0.04
## famrel -0.40 0.15 -2.79 0.01
## goout 0.76 0.13 6.02 0.00
## absences 0.08 0.02 3.54 0.00
## ------------------------------------------------
m3 <- glm(high_use ~ failures + sex + address + famrel + goout + absences, family="binomial", data=pormath)
exp(cbind(OR = coef(m3), confint(m3)))
## OR 2.5 % 97.5 %
## (Intercept) 0.1030586 0.0248787 0.4001095
## failures 1.4913178 0.9538249 2.3728914
## sexM 2.7874285 1.6580283 4.7602275
## addressU 0.5238851 0.2836255 0.9691226
## famrel 0.6672113 0.5004071 0.8856192
## goout 2.1448371 1.6844140 2.7725681
## absences 1.0816926 1.0365062 1.1324925
# predict() the probability of high_use
probabilities3 <- predict(m3, type = "response")
# add the predicted probabilities
pormath <- mutate(pormath, probability = probabilities3)
# use the probabilities to make a prediction of high_use
pormath <- mutate(pormath, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = pormath$high_use, prediction = pormath_s$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 99 12
ggplot(pormath, aes(probability, high_use, col=prediction)) +
geom_point()
table(high_use = pormath$high_use, prediction = pormath$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64054054 0.05945946 0.70000000
## TRUE 0.15135135 0.14864865 0.30000000
## Sum 0.79189189 0.20810811 1.00000000
loss_func1 <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func1(class = pormath$high_use, prob = pormath$probability)
## [1] 0.2108108
#cross-validate
cv2 <- boot::cv.glm(data = pormath, cost = loss_func, glmfit = m3, K = 10)
#print error-rate of cross-validation
cv2$delta
## [1] 0.2216216 0.2254054
In this exercise the data used is the Boston dataset included in the MASS package. It comprises of 506 observations of 14 variables. The original rationale of the data was to predict housing prices by the NOx-levels by the area (see Harrison & Rubinfield, 1978). The 506 observations are of census tracts, and not districts as I supposed by the structure of the data. Details of the variables can be found here
Further, the rationale of this exercise is to find appropriate number of clusters.
Below you find the summary of the variables, overview of the data and density plots of the variables and finally the correlation plot of the data.
From these figures we can see, that accessibility to radial highways (rad) correlates strongly, (r = .91), with full-value property-tax rate (tax). The NOx levels seem to be negatively correlated to distance to Boston employment centers (r = -.75), in other word, the further from these centres, the lower the NOx-levels. Further, it seems like that the older the buildings the higher the NOx levels (r = .73), which I assume might correspond to city-center areas. With eyeballing these figures, it could be argued that the data might be clustered to two clusters which correspond to the shapes, peaks and polarisation seen in the density plots.
library(tidyverse)
library(corrplot)
library(MASS)
library(viridis)
library(GGally)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
Boston %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_density() +
facet_wrap(~var_name, scales="free") + theme_bw()
cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(cor_matrix, method="color", type = "lower", cl.pos="b", tl.pos="d", tl.cex=0.7, cl.cex = 0.7, col = viridis(n=100), tl.col = "black" ,addCoef.col=T, number.cex=0.5)
First, the data will be scaled (x - mean(x)) / sd(x)). Then we create a categorical variable of the crime-variable with quantiles as break points. Then we remove the old crim variable, and add the categorised crime variable into the dataframe instead.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks=bins, include.lowest=T, label=c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Next the data will be divided to train and test sets.
n <- 506
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test<-boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Next I run a LDA discriminant analysis. The analysis identifies the high crime rate class rather well, however a few med_high classes would be wrongly classified. The predictions among low crime category is also good. However, in the mid categories, the classification could be better. Rad, ie., access to radial highways, is the strongest predictor in this model. Proportion of residential land and NOx being the second largest estimates.
# MASS and train are available
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2400990 0.2425743 0.2549505
##
## Group means:
## zn indus chas nox rm age
## low 0.94550337 -0.9113049 -0.08661679 -0.8799680 0.4774434 -0.8814792
## med_low -0.09206328 -0.3150893 -0.02879709 -0.5722726 -0.1261929 -0.3847239
## med_high -0.37173443 0.1214695 0.16959035 0.2854773 0.1642232 0.3496570
## high -0.48724019 1.0170891 -0.08120770 1.0891473 -0.4468777 0.8297720
## dis rad tax ptratio black lstat
## low 0.8261448 -0.6882537 -0.7436993 -0.48755665 0.3875502 -0.77903790
## med_low 0.3697808 -0.5437961 -0.4849496 -0.08565135 0.3279776 -0.14675853
## med_high -0.3263681 -0.3900591 -0.2965035 -0.16328022 0.1315624 -0.06035831
## high -0.8397379 1.6384176 1.5142626 0.78111358 -0.7395864 0.89099707
## medv
## low 0.56213606
## med_low -0.02049302
## med_high 0.22054966
## high -0.72667349
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.083996778 0.663842710 -0.87404743
## indus 0.007118397 -0.257469431 0.36162046
## chas -0.087003704 -0.053228386 0.06822034
## nox 0.497021117 -0.804277905 -1.37469582
## rm -0.089507406 -0.073338199 -0.16005020
## age 0.245543799 -0.270059998 -0.16872710
## dis 0.026591819 -0.268187693 0.15876190
## rad 2.903080486 1.140265129 0.09509299
## tax 0.043333306 -0.127764129 0.37883694
## ptratio 0.118130448 -0.142815306 -0.38903761
## black -0.123539774 -0.008512562 0.10116545
## lstat 0.267979096 -0.280132977 0.38803245
## medv 0.215683028 -0.508477333 -0.30261083
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9504 0.0367 0.0130
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "blue", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.2)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 4 2 0
## med_low 7 16 6 0
## med_high 0 6 22 0
## high 0 0 0 24
Next I re-run and scale the Boston data set and calculate distances between the observations.
data("Boston")
boston_scaled_k <- scale(Boston)
boston_scaled_k <- as.data.frame(boston_scaled_k)
summary(dist(boston_scaled_k))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next the data will be clustered with K-means. The largest drop in the plot appears at two (2) clusters. Therefore, a two-cluster solution test follows.
kmax <- 10
set.seed(123)
totws <- sapply(1:kmax, function(k){kmeans(boston_scaled_k, k)$tot.withinss})
qplot(x = 1:kmax, y=totws, geom='line') + theme_bw() +
ylab("Total within sum of squares")
A two cluster solution is plotted below. As anticipated from the density plots before, it looks like that the same variables correspond to the two cluster solution that were eminent from the density-plots. Non-retail business acres per town (indus) and property-tax rate (tax) can be seen very clearly. Also NOx levels seem to be nicely clustered to different clusters. The density plots are shown here aswell for a scroll-free comparison.
kme <- kmeans(boston_scaled_k, centers = 2)
pairs(boston_scaled_k, col=kme$cluster)
boston_scaled_k %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_density() +
facet_wrap(~var_name, scales="free") + theme_bw()
In this exercise, PCA and MCA analysis are practiced with two different data sets.
For the PCA, data used is the Human Development (HDI) index of UNDP. The data used in this exercise has observations from 155 countries of eight variables listed below. Further information of the data set and the variables can be found here.
As variables used in this exercise have been wrangled further, below follows an explanation of each variable:
h <- read.csv('./data/human.csv', header= T, row.names = 1)
library(GGally)
library(corrplot)
library(tidyverse)
library(dplyr)
library(viridis)
str(h)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Following, you can find summaries and figures of the data at hand. We find, that there are large discrepancies in the data set, i.e., large inequality in between countries. From summaries, we see that more males are given the opportunity to finish secondary education compared to females and males are also more often in the labor force. The excepted years in education varies from 5 to 20 years, both mean and median being around 13 years. Life expectancy ranges from 49 to 83.5, maternal mortality rate from 0,001 % to 1,1 %. Adolescent birth-rate varies from 0,06 % to 20,5 %. Also, there are parliaments with no females, to parliaments having 57.5% female representatives.
From correlations, we see that maternal mortality rate is related to adolescent birth rate (r = .76). Expected years in education is positively correlated with life expectancy (r = .79), GNI (r = .62) and negatively with both maternal mortality (r = -.74) and adolescent birth rate (r = -.70), indicating that both expected years in education and life expectancy being properties of countries with higher GNI.
summary(h)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
my_fn <- function(data, mapping, method="p", use="pairwise", ...){
# grab data
x <- eval_data_col(data, mapping$x)
y <- eval_data_col(data, mapping$y)
# calculate correlation
corr <- cor(x, y, method=method, use=use)
# calculate colour based on correlation value
# Here I have set a correlation of minus one to blue,
# zero to white, and one to red
# Change this to suit: possibly extend to add as an argument of `my_fn`
colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]
ggally_cor(data = data, mapping = mapping, ...) +
theme_void() +
theme(panel.background = element_rect(fill=fill))
} #this function was written by user20650 from stackoverflow, #https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values
ggpairs(h,
upper = list(continuous = my_fn), lower=list(combo=wrap("facethist", binwidth=20, size=1)))
h %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_histogram() +
facet_wrap(~var_name, scales = "free_x") +
theme_bw()
First we run PCA on non-scaled data, which naturally doesn’t make much sense since the scale of GNI is out of the roof compared to other scales (see graph below). It would imply that GNI alone explained 100% of the variance in the data. Therefore it is reasonable to scale the data before doing PCA, which follows.
pca_h <- prcomp(h)
s <- summary(pca_h)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
pca_pr1 <- round(100*s$importance[2,], digits = 1)
pca_pr1
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc_lab <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
biplot(pca_h, choises=1:2, cex=c(0.8,1), col=c("slategray3","royalblue3"), xlab=pc_lab[1], ylab=pc_lab[2])
Therefore it is reasonable to scale the data before doing PCA.
On the bi-plot below, the arrows correspond to correlations. We see, that Edu.Exp, Life.Exp, Edu2.FM and GNI have a positive correlation, and these are negatively correlated to mat.mor and ado.birth. First component comprises of these aforementioned factors explaining 53.6 % of the variance, and second component of the male-female ratios in labor force and parliament explaining 16.2 % of the variance.
h_scaled <- scale(h)
pca_h_s <- prcomp(h_scaled)
s2 <- summary(pca_h_s)
s2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
pca_pr2 <- round(100*s2$importance[2,], digits = 1)
pca_pr2
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
biplot(pca_h_s, choises=1:2, cex=c(0.8, 1), col=c("slategray3","royalblue3" ), xlab=pc_lab2[1], ylab=pc_lab2[2])
The Tea data set from FactoMineR-library has 300 observations of 36 variables. The data set is about tea preferences. It is worth to note that there is not much documentation on this data set available, so some of my assumptions are pure guesses at best.
First we do a MCA on the whole data, excluding the variable of age as it was not a factor. As the plot doesn’t make much sense, I chose a few variables that I assumed to correspond to different dimensions.
I chose the following variables:
With these factors, the explained variance is enhanced on dimension 1. On dimension one, the factors correspond to being young or old, sportsman or not, the tea being diuretic or not. In a sense, correspondance to lifestyles can be seen here. On the second dimension shows a loading between those who prefer quality/speciality rather than big brands; the ones who prefer buying their teas from tea shops and unpacked versus those who buy teabags from chainstores.
Further, with plotellipses-function it is possible to draw confidence ellipses. There we can confirm that the type of the tea and where it is bought are most clearly separate groups from each other.
library(FactoMineR)
data("tea")
str(tea) #drop age, since it is not a factor
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
#ggpairs(tea)
tea_s <- subset(tea, select= -age) #drop age from df
mca <- MCA(tea_s, graph=F)
summary(mca)
##
## Call:
## MCA(X = tea_s, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.090 0.082 0.070 0.063 0.056 0.053 0.050
## % of var. 5.838 5.292 4.551 4.057 3.616 3.465 3.272
## Cumulative % of var. 5.838 11.130 15.681 19.738 23.354 26.819 30.091
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.048 0.047 0.044 0.041 0.040 0.039 0.037
## % of var. 3.090 3.053 2.834 2.643 2.623 2.531 2.388
## Cumulative % of var. 33.181 36.234 39.068 41.711 44.334 46.865 49.252
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.035 0.034 0.032 0.031 0.031 0.030
## % of var. 2.302 2.275 2.172 2.085 2.013 2.011 1.915
## Cumulative % of var. 51.554 53.829 56.000 58.086 60.099 62.110 64.025
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.028 0.027 0.026 0.025 0.025 0.024 0.024
## % of var. 1.847 1.740 1.686 1.638 1.609 1.571 1.524
## Cumulative % of var. 65.872 67.611 69.297 70.935 72.544 74.115 75.639
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.021 0.020 0.020 0.019 0.019
## % of var. 1.459 1.425 1.378 1.322 1.281 1.241 1.222
## Cumulative % of var. 77.099 78.523 79.901 81.223 82.504 83.745 84.967
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.018 0.017 0.017 0.016 0.015 0.015 0.014
## % of var. 1.152 1.092 1.072 1.019 0.993 0.950 0.924
## Cumulative % of var. 86.119 87.211 88.283 89.301 90.294 91.244 92.169
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.014 0.013 0.012 0.011 0.011 0.010 0.010
## % of var. 0.891 0.833 0.792 0.729 0.716 0.666 0.660
## Cumulative % of var. 93.060 93.893 94.684 95.414 96.130 96.796 97.456
## Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.009 0.009 0.008 0.007 0.006
## % of var. 0.605 0.584 0.519 0.447 0.390
## Cumulative % of var. 98.060 98.644 99.163 99.610 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.580 1.246 0.174 | 0.155 0.098 0.012 | 0.052 0.013
## 2 | -0.376 0.522 0.108 | 0.293 0.350 0.066 | -0.164 0.127
## 3 | 0.083 0.026 0.004 | -0.155 0.099 0.015 | 0.122 0.071
## 4 | -0.569 1.196 0.236 | -0.273 0.304 0.054 | -0.019 0.002
## 5 | -0.145 0.078 0.020 | -0.142 0.083 0.019 | 0.002 0.000
## 6 | -0.676 1.693 0.272 | -0.284 0.330 0.048 | -0.021 0.002
## 7 | -0.191 0.135 0.027 | 0.020 0.002 0.000 | 0.141 0.095
## 8 | -0.043 0.007 0.001 | 0.108 0.047 0.009 | -0.089 0.038
## 9 | -0.027 0.003 0.000 | 0.267 0.291 0.049 | 0.341 0.553
## 10 | 0.205 0.155 0.028 | 0.366 0.546 0.089 | 0.281 0.374
## cos2
## 1 0.001 |
## 2 0.021 |
## 3 0.009 |
## 4 0.000 |
## 5 0.000 |
## 6 0.000 |
## 7 0.015 |
## 8 0.006 |
## 9 0.080 |
## 10 0.052 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.182 0.504 0.031 3.022 | 0.020 0.007 0.000 0.330 |
## Not.breakfast | -0.168 0.465 0.031 -3.022 | -0.018 0.006 0.000 -0.330 |
## Not.tea time | -0.556 4.286 0.240 -8.468 | 0.004 0.000 0.000 0.065 |
## tea time | 0.431 3.322 0.240 8.468 | -0.003 0.000 0.000 -0.065 |
## evening | 0.276 0.830 0.040 3.452 | -0.409 2.006 0.087 -5.109 |
## Not.evening | -0.144 0.434 0.040 -3.452 | 0.214 1.049 0.087 5.109 |
## lunch | 0.601 1.678 0.062 4.306 | -0.408 0.854 0.029 -2.924 |
## Not.lunch | -0.103 0.288 0.062 -4.306 | 0.070 0.147 0.029 2.924 |
## dinner | -1.105 2.709 0.092 -5.240 | -0.081 0.016 0.000 -0.386 |
## Not.dinner | 0.083 0.204 0.092 5.240 | 0.006 0.001 0.000 0.386 |
## Dim.3 ctr cos2 v.test
## breakfast -0.107 0.225 0.011 -1.784 |
## Not.breakfast 0.099 0.208 0.011 1.784 |
## Not.tea time 0.062 0.069 0.003 0.950 |
## tea time -0.048 0.054 0.003 -0.950 |
## evening 0.344 1.653 0.062 4.301 |
## Not.evening -0.180 0.864 0.062 -4.301 |
## lunch 0.240 0.343 0.010 1.719 |
## Not.lunch -0.041 0.059 0.010 -1.719 |
## dinner 0.796 1.805 0.048 3.777 |
## Not.dinner -0.060 0.136 0.048 -3.777 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.031 0.000 0.011 |
## tea.time | 0.240 0.000 0.003 |
## evening | 0.040 0.087 0.062 |
## lunch | 0.062 0.029 0.010 |
## dinner | 0.092 0.000 0.048 |
## always | 0.056 0.035 0.007 |
## home | 0.016 0.002 0.030 |
## work | 0.075 0.020 0.022 |
## tearoom | 0.321 0.019 0.031 |
## friends | 0.186 0.061 0.030 |
plot(mca, invisible=c("ind"), habillage="quali", graph.type = "ggplot")
plotellipses(mca, graph.type = c("ggplot"))
keep_b <- c("Sport", "diuretic", "frequency", "where", "how", "sugar", "slimming", "relaxing", "effect.on.health", "feminine", "sex", "age_Q")
tea_b <- dplyr::select(tea, all_of(keep_b))
mca_b <- MCA(tea_b, graph=F)
summary(mca_b)
##
## Call:
## MCA(X = tea_b, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.166 0.154 0.138 0.116 0.106 0.098 0.092
## % of var. 10.473 9.741 8.710 7.331 6.689 6.188 5.788
## Cumulative % of var. 10.473 20.214 28.923 36.255 42.944 49.132 54.920
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.086 0.077 0.076 0.069 0.065 0.064 0.058
## % of var. 5.451 4.870 4.819 4.385 4.099 4.042 3.678
## Cumulative % of var. 60.372 65.242 70.061 74.445 78.545 82.587 86.265
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19
## Variance 0.055 0.052 0.042 0.038 0.031
## % of var. 3.491 3.267 2.645 2.385 1.946
## Cumulative % of var. 89.757 93.024 95.669 98.054 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.571 0.655 0.224 | 0.093 0.019 0.006 | 0.055
## 2 | 0.161 0.052 0.023 | 0.023 0.001 0.000 | 0.159
## 3 | 0.248 0.124 0.065 | -0.165 0.058 0.028 | 0.026
## 4 | -0.639 0.822 0.379 | -0.317 0.217 0.093 | 0.069
## 5 | 0.013 0.000 0.000 | -0.004 0.000 0.000 | 0.115
## 6 | -0.559 0.627 0.312 | -0.232 0.117 0.054 | -0.036
## 7 | -0.321 0.207 0.056 | 0.094 0.019 0.005 | -0.083
## 8 | -0.060 0.007 0.002 | -0.380 0.312 0.088 | 0.147
## 9 | -0.084 0.014 0.004 | 0.297 0.191 0.055 | -0.774
## 10 | 0.254 0.129 0.040 | 0.293 0.186 0.053 | -0.612
## ctr cos2
## 1 0.007 0.002 |
## 2 0.061 0.023 |
## 3 0.002 0.001 |
## 4 0.011 0.004 |
## 5 0.032 0.013 |
## 6 0.003 0.001 |
## 7 0.017 0.004 |
## 8 0.053 0.013 |
## 9 1.448 0.375 |
## 10 0.906 0.233 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## Not.sportsman | 0.406 3.338 0.111 5.769 | -0.066 0.095
## sportsman | -0.274 2.256 0.111 -5.769 | 0.045 0.064
## diuretic | 0.406 4.808 0.228 8.253 | 0.039 0.048
## Not.diuretic | -0.561 6.640 0.228 -8.253 | -0.054 0.066
## 1/day | -0.367 2.146 0.062 -4.323 | 0.086 0.126
## 1 to 2/week | -0.698 3.595 0.084 -5.007 | -0.317 0.795
## +2/day | 0.466 4.621 0.159 6.905 | -0.092 0.194
## 3 to 6/week | 0.189 0.203 0.005 1.168 | 0.514 1.618
## chain store | -0.221 1.573 0.087 -5.099 | -0.449 6.958
## chain store+tea shop | 0.497 3.233 0.087 5.098 | 0.367 1.893
## cos2 v.test Dim.3 ctr cos2 v.test
## Not.sportsman 0.003 -0.940 | 0.223 1.214 0.034 3.172 |
## sportsman 0.003 0.940 | -0.151 0.820 0.034 -3.172 |
## diuretic 0.002 0.794 | 0.145 0.741 0.029 2.955 |
## Not.diuretic 0.002 -0.794 | -0.201 1.024 0.029 -2.955 |
## 1/day 0.003 1.010 | 0.091 0.158 0.004 1.069 |
## 1 to 2/week 0.017 -2.270 | 0.808 5.793 0.112 5.796 |
## +2/day 0.006 -1.364 | -0.289 2.140 0.061 -4.285 |
## 3 to 6/week 0.034 3.177 | -0.220 0.331 0.006 -1.358 |
## chain store 0.358 -10.342 | 0.299 3.461 0.159 6.898 |
## chain store+tea shop 0.047 3.762 | -1.114 19.493 0.436 -11.417 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Sport | 0.111 0.003 0.034 |
## diuretic | 0.228 0.002 0.029 |
## frequency | 0.210 0.051 0.139 |
## where | 0.097 0.531 0.476 |
## how | 0.012 0.516 0.596 |
## sugar | 0.288 0.020 0.002 |
## slimming | 0.104 0.028 0.146 |
## relaxing | 0.039 0.117 0.011 |
## effect.on.health | 0.001 0.012 0.065 |
## feminine | 0.252 0.103 0.038 |
plot(mca_b, invisible=c("ind"), habillage="quali", graph.type="ggplot")
plotellipses(mca_b, graph.type = c("ggplot"))